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Abstract 

Background: Estimating the rate constants of a biochemical reaction system with known stoichiometry from noisy 
time series measurements of molecular concentrations is an important step for building predictive models of 
cellular function. Inference techniques currently available in the literature may produce rate constant values that 
defy necessary constraints imposed by the fundamental laws of thermodynamics. As a result, these techniques may 
lead to biochemical reaction systems whose concentration dynamics could not possibly occur in nature. Therefore, 
development of a thermodynamically consistent approach for estimating the rate constants of a biochemical 
reaction system is highly desirable. 

Results: We introduce a Bayesian analysis approach for computing thermodynamically consistent estimates of the 
rate constants of a closed biochemical reaction system with known stoichiometry given experimental data. Our 
method employs an appropriately designed prior probability density function that effectively integrates 
fundamental biophysical and thermodynamic knowledge into the inference problem. Moreover, it takes into 
account experimental strategies for collecting informative observations of molecular concentrations through 
perturbations. The proposed method employs a maximization-expectation-maximization algorithm that provides 
thermodynamically feasible estimates of the rate constant values and computes appropriate measures of 
estimation accuracy. We demonstrate various aspects of the proposed method on synthetic data obtained by 
simulating a subset of a well-known model of the EGF/ERK signaling pathway, and examine its robustness under 
conditions that violate key assumptions. Software, coded in MATLAB®, which implements all Bayesian analysis 
techniques discussed in this paper, is available free of charge at http://www.cis.jhu.edu/~goutsias/CSS%20lab/ 
software.html. 

Conclusions: Our approach provides an attractive statistical methodology for estimating thermodynamically 
feasible values for the rate constants of a biochemical reaction system from noisy time series observations of 
molecular concentrations obtained through perturbations. The proposed technique is theoretically sound and 
computationally feasible, but restricted to quantitative data obtained from closed biochemical reaction systems. 
This necessitates development of similar techniques for estimating the rate constants of open biochemical reaction 
systems, which are more realistic models of cellular function. 
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Background 

Biochemical reaction systems are popular models of cellu- 
lar function. These models are extensively used to repre- 
sent the inter-connectivity and functional relationships 
among molecular species in cells and, most often, they 
provide accurate description of cellular behavior. Inferring 
a biochemical reaction system from experimental data is 
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an important step towards building mathematical and 
computational tools for the analysis of cellular systems. 
This step requires both structure (stoichiometry) identifi- 
cation as well as parameter (rate constant) estimation 
[1-4]. Due however to the large combinatorial complexity 
of determining the stoichiometry of a biochemical reaction 
system, solving this problem requires large amounts of 
high quality experimental data and substantial computa- 
tional resources, which are not usually available in 
practice. 
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Recently, several approaches have been proposed in 
the literature for addressing a simpler problem, known 
as model calibration. The objective of model calibration 
is to adjust the kinetic parameters of a biochemical reac- 
tion system with given stoichiometry in order to obtain 
a sufficiently good match between simulated and 
observed dynamics; e.g. see [2,5-11]. 

Among known model calibration techniques, the ones 
based on Bayesian analysis [7,10,11] are perhaps the most 
versatile. Bayesian analysis allows us to effectively incor- 
porate biophysical knowledge into the problem at hand 
and naturally draw statistical conclusions about the 
unknown kinetic parameters. This is done by employing 
a probability density function that encapsulates prior 
information about the rate constants of a biochemical 
reaction system and by deriving a posterior probability 
density function over the kinetic parameters after experi- 
mental data have been collected. By taking into account 
the experimental data and the information contained in 
the prior, the posterior density summarizes all knowledge 
available about the unknown kinetic parameters and 
quantifies uncertainty about their true values [12,13]. 
Moreover, the posterior allows us to quantify our confi- 
dence about estimation accuracy, compute probabilities 
over alternative calibrations, and design additional 
experiments to improve inference. 

Most published model calibration techniques do not 
take into account constraints on the reaction rate con- 
stants imposed by the fundamental laws of thermody- 
namics. If these constraints, known as Wegscheider 
conditions [14,15], are not explicitly considered by a 
model calibration technique, then the method will spend 
most time examining impossible kinetic parameter sets 
and will most probably produce a biochemical reaction 
system that is not physically realistic [16]. This issue has 
been recently recognized in the literature, and new mod- 
eling formalisms have been suggested in an effort to 
address it [17-20]. The proposed formalisms describe a 
biochemical reaction system by well-defined thermody- 
namic parameters whose values always guarantee that the 
reaction rate constants satisfy the Wegscheider condi- 
tions. For example, in [19,20], a biochemical reaction sys- 
tem is parameterized in terms of molecular capacities 
and reaction resistances, by using a thermodynamic 
kinetic modeling (TKM) formalism that enjoys a number 
of advantages over the ones suggested in [17,18]. 

We believe that parameterizing a biochemical reaction 
system in terms of capacities and resistances is unneces- 
sary and, in certain instances, problematic. It has been 
pointed out in [19] that different choices for the TKM 
parameters can lead to the same concentration dynamics. 
As a consequence, the TKM parameters cannot be deter- 
mined uniquely from concentration measurements. A 
way to address this problem is to take the capacities to be 



the equilibrium concentrations (which is always possible 
in closed biochemical reaction systems), in which case 
the capacities are constrained by conservation relation- 
ships imposed by the system stoichiometry. Then, para- 
meter estimation in the TKM formalism may be possible 
by arbitrarily fixing a subset of capacity values and esti- 
mating the remaining capacities and resistances. How- 
ever, this approach can be very cumbersome when 
dealing with molecular perturbations (as we do in this 
paper) or when merging estimated TKM models, since, 
in both cases, the capacities may not refer to compatible 
equilibrium concentrations. It has been suggested in [19] 
that a way to merge two models using the TKM formal- 
ism is to first convert the capacities and resistance to the 
rate parameters, merge the two models, and then convert 
back to the TKM formalism. However, this approach 
seems to be overly complicated, especially in view of the 
model calibration methodology presented here. 

In this paper, we introduce a thermodynamically con- 
sistent Bayesian analysis approach to model calibration 
that does not require reparametrization. Our approach 
relies on statistically modeling the reaction rate constants 
of the forward reactions as well as the equilibrium con- 
stants of individual reactions. We restrict our attention 
to closed systems (or systems that can be approximately 
considered to be closed), since thermodynamic analysis 
of such systems is easier to handle than open systems. 
The proposed approach controls thermodynamic consis- 
tency of the reaction rate constants by employing well- 
defined relationships between the kinetic parameters of a 
biochemical reaction system, imposed by the Wegschei- 
der conditions. By embedding these relationships within 
an iterative algorithm that finds the mode of the poster- 
ior density, we arrive at a thermodynamically consistent 
Bayesian estimate for the rate constants. 

Bayesian analysis can be appreciably influenced by the 
choice of the prior probability density functions. This is 
particularly true in systems biology problems in which 
only a small number of observations is usually available. 
It is therefore important to focus our effort on con- 
structing appropriate prior densities for the unknown 
rate constants of the forward reactions and the equili- 
brium constants of individual reactions. Although a 
number of choices may be possible, it is imperative to 
use fundamental biophysical and thermodynamic princi- 
ples to derive informative prior densities that effectively 
encapsulate such principles. 

By using the classical Arrhenius formula of chemical 
kinetics [21], we construct an appropriate prior density 
for the log-rate constants of the forward reactions. To 
do so, we assume that the prefactor and activation 
energy associated with the Arrhenius formula are both 
random variables following log-normal and exponential 
distributions, respectively. This approach takes into 
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account unpredictable changes in biochemical condi- 
tions affecting the structure of the reactant molecules 
and the probability of reaction after collision. On the 
other hand, by exploiting the thermodynamic relation- 
ship between rate constants, equilibrium concentrations, 
and stoichiometric coefficients, we derive an analytical 
expression for the joint prior density of the logarithms 
of the equilibrium constants. This expression depends 
on steady-state concentration measurements and on the 
stoichiometry of the biochemical reaction system under 
consideration. 

Another important issue associated with the inference 
problem considered in this paper is the need to collect an 
informative set of measurements that can lead to suffi- 
ciently accurate parameter estimation. It has been increas- 
ingly recognized in the literature that a powerful approach 
to accomplish this goal in problems of systems biology is 
to selectively perturb key molecular components and mea- 
sure the effects of these perturbations on the underlying 
concentrations [22-24]. We follow this strategy here and 
assume that we can selectively perturb, one at a time, the 
initial concentrations of a selected number of molecular 
species in a biochemical reaction system, by increasing or 
decreasing their values without altering the underlying 
stoichiometry. This can be achieved by a variety of experi- 
mental techniques, such as RNA interference (RNAi), 
transfection, or molecular injection. Therefore, our 
approach combines Bayesian analysis with current experi- 
mental practices, thus bridging the gap between statistical 
inference approaches and experimental design. 

The Bayesian analysis technique discussed in this paper 
requires numerical evaluation of a number of statistical 
summaries of the posterior density. Although several 
methods are available to deal with this problem (e.g., see 
[25,26]), we employ here a maximization-expectation- 
maximization (MEM) strategy that calculates a thermo- 
dynamically consistent estimate of the reaction rate 
constants as well as Monte Carlo estimates of posterior 
summaries used to evaluate the quality of inference. This 
strategy is based on sequentially combining a powerful 
stochastic optimization technique, known as simultaneous 
perturbation stochastic approximation (SPSA) [27], with 
Markov chain Monte Carlo (MCMC) sampling [25]. Our 
experience with extensive synthetic experiments, based on 
data obtained by simulating a subset of a well-known 
model of the EGF/ERK signaling pathway, indicates that 
the proposed algorithm is robust, producing excellent esti- 
mation results even in cases of high measurement errors 
and limited time measurements. 

This paper is structured as follows. In the "Methods" 
section, we provide a brief overview of biochemical reac- 
tion systems, discuss how to model perturbations, and 
present a standard statistical model for the measure- 
ments. We then outline our Bayesian analysis approach 



to model calibration and present our choices for the 
prior and posterior densities. By emphasizing the fact 
that the prior density must assign zero probability over 
the thermodynamically infeasible region of the para- 
meter space, and by employing an encompassing prior 
approach to Bayesian analysis, we derive an appropriate 
posterior density that satisfies this condition. We finally 
outline our proposed methodology for computing 
thermodynamically consistent Bayesian estimates of 
the kinetic parameters and for assessing estimation 
accuracy. 

In the "Results/Discussion" section, we provide simula- 
tion results, based on a subset of a well-established model 
of the EGF/ERK signal transduction pathway. These 
results illustrate key aspects of the proposed model cali- 
bration methodology and show its potential for producing 
sufficiently accurate thermodynamically consistent esti- 
mates of a biochemical reaction system from noisy time- 
series measurements of molecular concentrations. 

Finally, in the "Conclusions" section, we discuss a key 
statistical advantage of the proposed model calibration 
methodology, viewed from a bias-variance tradeoff per- 
spective. Moreover, we provide suggestions for further 
research to address a number of practical issues asso- 
ciated with model calibration, such as estimating the 
initial concentrations and their perturbations, dealing 
with partially observed or missing data, and extending 
the proposed technique to the case of open biochemical 
reaction systems. 

Extensive mathematical and computational details 
are required to rigorously formulate, derive, and 
understand various aspects of the proposed approach. 
We provide these details in three Additional files 
accompanying this paper. In Additional file 1, we pre- 
sent a detailed exposition of the underlying theory, 
whereas, in Additional file 2, we carefully discuss com- 
putational implementation. Finally, in Additional file 3, 
we provide all necessary details pertaining the bio- 
chemical reaction system we use in our simulations. 
Well-documented software, coded in MATLAB 8 , 
which implements all Bayesian analysis techniques dis- 
cussed in this paper, is available to interested readers 
free of charge at http://www.cis.jhu.edu/~goutsias/CSS 
%20lab/software.html. 

Methods 

Biochemical reaction systems 

In this paper, we consider a biochemical reaction system 
comprised of N molecular species X lt X 2 , X N that 
interact through M coupled reactions of the form: 

N N 

2 V ™ X «H=^Z V '""> X "' msM:={l,2 M}. (1) 
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The parameters k 2m -i and k 2m are the rate constants 
of the forward and reverse reactions, whereas, v nm , 
v' nm > Oare the stoichiometries of the reactants and 

products. Note that k 2m ^, k 2m >0, for all m e ..M, since 
irreversible reactions are thermodynamically not possible 
in a closed biochemical reaction system [19]. We will 
assume that the system is well-mixed (homogeneous) 
with constant temperature and volume. We will also 
assume that the molecular concentrations evolve con- 
tinuously as a function of time and that all reactions 
can be sufficiently characterized by the mass action rate 
law. In this case, we can describe the dynamic evolution 
of the molecular concentrations in the system by the 
following chemical kinetic equations: 



dx^\t) _ 
dt 



= X s »™ p ™ )W ' tsT,nsM,psV, ( 2 ) 



initialized by 

4 p) (o) 



c p +n p , 



if n = p * 0 

if p = 0 or n * p * 0, 



(3) 



where p„'(r)is the net flux of the m th reaction at 
time t, given by 



(4) 



s nm is the net stoichiometry coefficient of the « th 
molecular species associated with the m th reaction, 

defined by s nm := v' nm - v nm , and T := [0, £ max ] is an 
observation time window of interest. 

Equations (2)-(4) are based on the assumption that we 
can selectively perturb, one at a time, the concentrations 
of molecular species in a set V, by increasing or 
decreasing their values at time t = 0 without altering the 
underlying stoichiometry. For notational convenience, 
we include 0 in J> and assign p = 0 to the original 

unperturbed system. In this case, x^\t) is the concen- 
tration of the n th molecular species in the unperturbed 
system at time t, whereas, xff'(t) , for p *0, is the con- 
centration of the n th molecular species at time t, 
obtained by perturbing the initial concentration of the 
p th species. In (3), n p >-c p quantifies the perturbation 
applied on the initial concentration c p of the p th molecu- 
lar species at time t = 0. When -c p <n p <0, the initial 
concentration of the p th molecular species is reduced, a 
situation that can be achieved by a variety of experimen- 
tal techniques, such as RNA interference (RNAi). On the 



other hand, when n p >0, the initial concentration of the 
p th molecular species is increased, a situation that can be 
achieved by transfection or molecular injection. 

Due to the enormous complexity of biological reaction 
networks, (1) is used to model a limited number of mole- 
cular interactions embedded within a larger and more 
complex system. Mass flow between the biochemical 
reaction system given by (1) and its surroundings compli- 
cates modeling. As a matter of fact, some molecular con- 
centrations in the system may be influenced by unknown 
reactions, not modeled by (1), or by partially known reac- 
tions with reactants regulated by unknown biochemical 
mechanisms. To address this problem, we will assume 
that there is no appreciable mass transfer between the 
biochemical reaction system and its surroundings during 
the observation time interval T = [0, £ max ]. As a conse- 
quence, we can assume that (1) characterizes a closed 
biochemical reaction system within T- Moreover, we will 
assume that the system reaches quasi- equilibrium at 
some time U < f max > after which its thermodynamic prop- 
erties do not appreciably change for U < t < f max . Note 
however that the quasi-equilibrium assumption does not 
necessarily imply that the biochemical reaction system 
will be at thermodynamic equilibrium after time f max , 
since mass transfer may take place at some time t > £ max . 
Although we may be able to satisfy these assumptions by 
appropriately designed synthetic or in vitro biological 
experiments, the assumptions are certainly not satisfied 
in vivo. For this reason, we believe that future research 
must be focused on extending the approaches and 
techniques discussed in this paper to the case of open 
biochemical reaction systems. 

Measurements 

We will now specify an appropriate model for the avail- 
able measurements. We will assume that, by an appro- 
priately designed experiment, we can obtain noisy 

measurements y := {ySf'(r<j)< n e Af,p& V,cj& Q} and 

y := {ylC\tQ +1 ),ns Af,p& V}of the concentrations of 

all molecular species in the unperturbed and perturbed 
systems at a limited number of distinct time points 
t 1 < t 2 < ... < t Q < t Q+1 in T, where Q := {1, 2, ... , Q}. 
We will also assume that these measurements are 

related to the true concentrations xff^t^) by 

y(f>(g = ln[4 P) (t„)4 P) (g] = ln4 P) (g + 4"^,), 
ne Af,peP, 



(5) 



for q = 1, 2, ... , Q + 1, where e^ p '(r^) is a multiplica- 
tive random error factor and rj^\t q ) := lnetf\t q ) . The 
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assumption of multiplicative errors is common in most 
data acquisition procedures, such as DNA microarray- 
based genomics and mass spectrometry-based proteo- 
mics [28-30], whereas, the logarithm is used to obtain a 
convenient additive error model for the measurements. 

In the following, we will assume that the biochemical 
reaction system, and all its perturbed versions, is suffi- 
ciently close to steady-state at time point t Q+1 . We can 
justify this assumption by taking U < t Q+ i < f max and by 
recalling our previous assumption that the biochemical 
reaction system is at thermodynamic quasi-equilibrium 
at times U < t < f max . Our Bayesian analysis approach is 
based on data y, whereas, we use the steady-state mea- 
surements y to derive a joint probability density func- 
tion for the logarithms {ln(k 2m -i/k 2m ), m e <M\ of the 
equilibrium constants of the reactions needed for speci- 
fying the posterior density. 

Finally, we will assume that the error components 

77 ^(tq) are statistically independent zero-mean Gaussian 

random variables. The Gaussian assumption is quite 
common in genomic problems and has been experimen- 
tally verified in some cases; e.g., see [31]. This assumption 
is usually justified by the central limit theorem and the 
premise that the errors are due to a large number of 
independent multiplicative error sources. We may 
attempt to justify the independence assumption between 
measurement errors by arguing that an error occurred in 
a particular measurement may only be due to the acquisi- 
tion process used to obtain that measurement and, 
hence, it may not affect the error values of other mea- 
surements. In general, however, this is only a mathemati- 
cally convenient assumption that may not be realistic. 
We experimentally demonstrate later that, at least for the 
example considered in this paper, the proposed estima- 
tion methodology is quite effective even in the case of 
non-Gaussian and correlated measurement errors. For 
simplicity, we finally assume equal error variances; i.e., 

we will assume that var^^^t^)] = a 2 , for every n, p, and 

q. This assumption is not crucial to our approach and 
can be relaxed if necessary. 

Bayesian model calibration 

In this paper, we deal with the following problem: Given 
noisy concentration measurements y and y , we want to 
calculate thermodynamically consistent estimates of the 
log-rate constants k := {n 2m -i ■= In k 2m -\, K 2m := In k 2m 
m e of a closed biochemical reaction system, such 
that (2), initialized by (3), produce molecular concentra- 
tions xff-'(t) that "best" match (in some well-defined 
sense) the available measurements. 



We should note here that it is convenient to estimate 
the logarithms of the rate constants instead of the con- 
stants themselves. By focusing on the logarithms, we 
can reduce the dynamic range of rate constant values 
and make their estimation numerically easier. To sim- 
plify our developments, we will assume that the initial 
concentrations {c n , n e W) and perturbations {n p , 
p e V] are known or have been estimated by an appro- 
priate experimental procedure. When this is not true, 
these quantities must be treated as unknown parameters 
and estimated from data, together with the rate con- 
stants, provided that a sufficient amount of data is avail- 
able to allow reliable estimation. 

Given data y, the objective of Bayesian analysis is to eval- 
uate the posterior probability density function p{n \ y), 
which summarizes our belief about the log-rate constants 
k after the data y have been collected. It can be 
shown [see Equations (S-1.4) and (S-1.5) in Additional 
file 1] that 

p(K\Y)~p(y\K)jp(K\z)p{z)dz, (6) 

where p <* q denotes that p is proportional to q, and 

p(y\K) = jp(y\K,CT 2 )p{cT 2 \K)dcT 2 , (7) 

with z = {Zm, m e ,J/\ being the set of log-equilibrium 
constants of the reactions, defined by 

fe 

z m :=\n^^- = K 2m _ 1 -K 2m , formeM (8) 
k 2m 

Note that the prior density of the log-rate constants 
/tdepends on z. For this reason, we view z as a set of 
random hyperparameters (in Bayesian analysis, para- 
meters used to specify prior densities are known as 
hyperparameters), specified by means of the prior den- 
sity p{z). 

The posterior density p(n \ y) takes into account our 
prior belief about the rate constant values and the data 
formation process, summarized by the prior density 
p{z) of the log-equilibrium constants, the conditional 
prior density p(n \ z) of the log-rate constants given 
the log-equilibrium constants, the conditional probabil- 
ity density p(a 2 \ n) of the error variance given the 
log-rate constants, and the likelihood p(y \ n, <J 2 ). 
However, the posterior density is hard to interpret, 
especially in high-dimensional problems that involve 
many parameters, such as the problem we are dealing 
with here. As a consequence, the main objective of 
Bayesian analysis is to produce numerical information 
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that can be effectively used to summarize the posterior 
density and simplify the task of statistical inference to 
the extent possible. Typical summaries include mea- 
sures of location and scale of the posterior, which are 
used to produce estimates for the parameter values and 
to evaluate the accuracy of such estimates, respectively. 

It is clear from (6) that, to evaluate the posterior 
p{n\ y), we need to compute the "effective" prior density 
jp(n\ z)p{z) dz as well as the "effective" likelihood jp(y \ 
n, (fjpio 2 | h^do 2 . To do so, we must specify the afore- 
mentioned densities pio 2 \ k), p(z), p{n\ z), and p{y \ k, 
O 2 ). We discuss this problem next. 

Prior density of error variance 

In general, it is difficult to derive an informative prior 
probability density function p(a 2 \ k) for the error var- 
iance. To deal with this problem, we assume here that the 
error variance is independent of the rate constants; i.e., we 
assume that pio 2 \ n) = pio 2 ). Moreover, we assume that 
O 2 follows an inverse gamma distribution, in which case 



Ha) 



(9) 



for two parameters a, b >0. 

The independence assumption between a 2 and nis 
reasonable, in view of the fact that the errors are mainly 
due to the experimental methodology used to obtain the 
measurements, whereas, the rate constants are due to 
biophysical principles underlying the biochemical reac- 
tion system. On the other hand, the choice given by (9) 
has been well-justified in Bayesian analysis. In fact, the 
inverse gamma distribution is the conjugate prior for 
the variance of additive Gaussian errors [13]. Conjugate 
priors are common in Bayesian analysis, since they often 
lead to attractive analytical and computational simplifi- 
cations. Note that E[<7 2 ] = b/(a - 1) and var[<7 2 ] = 
{E[o 2 ]} 2 /(a - 2) = b 2 /[(a - l) 2 (a - 2)], for a > 2. There- 
fore, the parameters a, b control the location and scale 
of the inverse gamma distribution given by (9). We illus- 
trate this prior in Figure S-1.3 of Additional file 1. In 
the following, we treat a and b as hyperparameters with 
known values. For a practical method to determine 
these values, the reader is referred to Additional file 1. 

Prior density of log-equilibrium constants 

Before we consider the problem of specifying a prior 
density for the log-equilibrium constants z, we first 
investigate how much information about z can be 
extracted from measurements. 

It is a direct consequence of thermodynamic analysis 
that, at steady-state, the net flux of each reaction in a 
closed biochemical reaction system must be zero. This 
implies that 



(10) 



for all m e M, p e V, 



by virtue of (4), where [x^ > 0,ns A/"} are the sta- 
tionary concentrations when the initial concentration of 
the p th molecular species is perturbed (thermodynamic 
analysis dictates that these concentrations must be non- 
zero, provided that the initial concentrations are non- 
zero). As a matter of fact, (10) is equivalent to the 
following constraints on the reaction rate constants (see 
Additional file 1): 



n 

meAi 



Y- 



c 2m 



for all re null(§), (11) 



known as Wegscheider conditions [14,15], where r m is 
the m th element of the M x 1 vector r, g is the N x M 
stoichiometry matrix of the biochemical reaction system 
with elements s nm , and null( § ) is the null space of § . 
As a consequence, for a biochemical reaction system to 
be physically realizable, it is required that the reaction 
rates satisfy the thermodynamically imposed Wegscheider 
conditions. 

From (8) and (10), note that 



P + 



y2,2 S " mlnX " P) ' imanmsM - (12) 



By employing (5) and (12), we can show that 

z m = y m -f/ m ' where 



Ym ■= J, J, s nm y { n p \t Q+1 ) and 

peV neAf 



(13) 



Using this result and some straightforward algebra 
(see Additional file 1), we can show that, given 

y:={y m ,m & M} , which can be calculated from the 

measurements y = {yl?\tQ +1 ), n e Af,p e V} of the 
steady-state molecular concentrations and (13), we can 
construct the posterior density p(z | y) of z by 



p{z | y) • 



2b 
P + l 



+ {z-y) T U-\z-y) 



-{M/2+a) 



, (14) 
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where H is an M x M matrix with elements 

h i = / , „s s , , and a, b are the two hyperpara- 

meters associated with the prior density of the measure- 
ment variance, given by (9). 
The previous result suggests that we may be able to use 

p(z [ y) as an informative prior for the log-equilibrium con- 
stants z; i.e., we may be able to replace p{z) by p(z | y) in 

(6). At a first glance, this idea may not seem appropriate. 
However, it perfectly agrees with the fact that, in Bayesian 
analysis, hyperparameters are often estimated directly from 
data [13]. Since we have shown here that steady-state mea- 
surements can be effectively used to construct the entire 
posterior probability density function of z, it seems reason- 
able to use this posterior as a prior density for z. Note how- 
ever that, by replacing p{z) with p(z | y) in (6), we must 

make sure that y is independent of y (see Additional file 

1). Otherwise, our choice for p(z) may not lead to a proper 
posterior density p{n \ y) (i.e., it may not lead to a density 
that is finite for all y). Note that the independence 

of y and y is assured by the independence between the 
measurement errors {j?if'(tQ+i)' ns N,pe. V) and 

An important observation here is that evaluation of 

p(z I y) > given by (14), may not be possible, since the 

matrix H may not be invertible. We can address this 
problem by decorrelating z using the singular value 
decomposition (SVD) of matrix H. As a consequence, 

we obtain H = U 0 D 0 Ul , where D 0 is an invertible 

diagonal matrix containing the nonzero singular values 

of H, and U 0 is an appropriately constructed matrix 

(see Additional file 1 for details). In this case, instead of 

using (14) for p(z | y) , we must use 



p[z | y) ■ 



2b 
P + l 



+ {z-yyU 0 B 0 1 U T 0 {z-y) 



,(15) 



which we can always evaluate, since matrix D 0 is 
invertible. 

Prior density of log-rate constants 

To specify the (conditional) prior density p(n \ z) of the 
log-rate constants of a biochemical reaction system, we will 
first derive a prior probability density function p(K 2m -i) for 
the log-rate constant of the m th forward reaction. To do so, 
we use the well-known Arrhenius formula of chemical 



kinetics [21] and set k 2m -i = <Xm Gxp{-E m /k B 1), where a m is 
the prefactor, E m is the activation energy of the reaction, k B 
is the Boltzmann constant (k B = 1.3806504 xlO" 23 J/K), and 
T is the temperature. Unfortunately, we cannot predict the 
values of the prefactor and activation energy precisely. To 

deal with this problem, we set a m = a ° exp{g m } and 

E m = E° m +U m , where a ° , E° m are the predictable por- 
tions of the prefactor and activation energy, respectively, 
and g m , U m are two random variables that model the 
unpredictable portions of these quantities. In the Addi- 
tional file 1, we argue that it is reasonable to model g m as a 
zero-mean Gaussian random variable with standard devia- 
tion X m , and U m is an exponential random variable with 

mean and standard deviation k B T^ , where T* m is a tem- 
perature larger than T. As a consequence, we obtain the 
following prior density for the log-rate constant K 2m -i of 
the m th forward reaction [see Equation (S-1.31) in Addi- 
tional file 1]: 



P{. K 2m-\) = 



2t„ 



erfc 



V2 



.0 \ 



X„ 



(16) 



,( lc 2m-l K m)hm 



where r m ■= T* m / T > 1 , K° m := lna° - E° m / k B T , and 
erfc[-] is the complementary error function. We illus- 
trate this prior in Figure S-l.l of Additional file 1. 

Basic thermodynamic arguments (see Additional file 1) 
imply that z m , defined by (8), is a constant characteristic 
to the m th reaction. Since n 2m = K 2m -i - z m , this implies 
that the rate constants K 2m and K 2m _i are two dependent 
random variables, given z m , with joint probability density 
I Zm) = $(k 2 + z m )p{K 2m ^), where 

S(-) is the Dirac delta function [see Equation (S-1.37) in 
Additional file 1], By assuming that the reaction rate 
constants of different reactions are mutually indepen- 
dent given the z's (which is reasonable if we assume that 
all common factors affecting these rates, such as tem- 
perature and pressure, are kept fixed), we obtain 



(17) 



meM 



Equations (16) and (17) provide an analytical form for 
the prior density of the log-rate constants. To use this 
expression, we must determine appropriate values for 

<p := {fc°,T ffl ,l ffl ,ra6 M}> which can be treated as 

hyperparameters. Although we could treat <p as random, 
we will choose here known values for these parameters. 
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This is motivated by the fact that (p determines the loca- 
tion and scale of the prior densities of the forward rate 
constants; see Figure S-l.l in Additional file 1. In cer- 
tain problems of interest, there might be enough infor- 
mation to determine possible ranges for the forward 
rate constant values. As a consequence, we can use this 
information, together with an appropriate procedure, to 
effectively determine values for (p. The reader is referred 
to Additional file 1 for details on how to do so. 

Effective likelihood 

Calculating the effective likelihood piy \ k), given by (7), 
is straightforward. From (5), (7), and (9), we can show 
that 

P(Y I *) « J CTN(P+1) 1 Q+2(a+ D exp{-^ }da 2 , (18) 
where 



cp(K, y) := 2b + ^ J S [Y « ][t l ] ~ ln x n%)] 2 - (19) 

neAf peV q&Q 



By setting f= 0(/c,y)/2o 2 in (18), we obtain 

p(y | r) - [«p( IC/ y)]- a - N ( p+1 W/ 2 J^ +N ( p+1 W/ 2 - 1 e -«^ 
oc[«p( K ,y)]-«- N ( p+1 W/ 2 . 



(20) 



Note that evaluating 4>{K,y) at given values of k and y 
requires integration of the system of ordinary differential 
equations (2). 

Posterior density 

Our previous developments lead finally to an analytical 
formula for the posterior density p{n\ y) of the log-rate 
constants. Indeed, (6), (15), (17), (19), and (20), lead to 



P(k I y) • 



with 



(o(k) 



(21) 



<o(r) = TTe* -J-i^L + ^—^L 



Y) = J—^ + X X e """'C^2 m -l " «2m ~ Ym)(« 2m'-l " «2m' ~ Ym) ^2) 
me.M m'eM * ' 

<p(k, y) = 2b + Y^ X5>i, p) (g - ln4")(t,)] 2 
/3 = a + N(P + 1)Q / 2, 

where Q mm ' are the elements of matrix 
UqD^Uq obtained from the SVD decomposition of 
§ T S' and y m is given by (13). 



Note that the posterior density of the log-rate con- 
stants is a compromise between the prior and the likeli- 
hood. The prior terms o)(k) and i//{k, y) penalize log- 
rate values that do not fit well with available a-priori 
information, whereas, the likelihood term <j)(n, y) pena- 
lizes log-rate values that produce concentration 
dynamics which deviate appreciably from measurements. 
As the number N(P + 1)Q of available measurements 
increases, this compromise is controlled to a greater 
extent by the data through the factor <f>(n, y). 

A problem arises with the posterior density p{n \ y), 
given by (21) and (22), since nonzero probabilities may 
be assigned to thermodynamically infeasible log-rate 
constants. A Bayesian analyst might argue that we have 
correctly done our job by formulating the problem as 
we did and that it is the data which will rule out the 
possibility that our biochemical reaction system can be 
characterized by thermodynamically infeasible para- 
meters. However, we choose to trust thermodynamics 
far more than we would trust noisy data and appropri- 
ately modify the posterior density based on our know- 
ledge that the kinetic parameters must satisfy the 
Wegscheider conditions given by (11). 

By using the Wegscheider conditions, we can decom- 
pose the 2M log-rate constants k into two mutually 
exclusive sets: M + M 1 "free" log-rate constants Kj and 
M -Mi "dependent" log-rate constants K d , where 
M 1 = rank(S)(see Additional file 1). Although para- 
meters Kf can take any value, parameters must be 
equal to Wfc^for the Wegscheider conditions to be 
satisfied, where W is an appropriately defined matrix. 
One way to incorporate the constraint K d = WKf into 

our Bayesian analysis problem is to treat it as prior 
information and apply it on the prior density of the 
unconstrained problem. This principle forms the basis 
of an attractive strategy for incorporating constraints 
into Bayesian analysis, known as encompassing prior 
approach (EPA) [32]. By following EPA, we can replace 
the previously discussed encompassing "effective" prior 
density \p{n \ z)p{z)dz by the following probability 
density function: 



Pw(. K f' K d) ■ 



S(K d -WK f )jp(Kf,K d \ Z ) P (z)dz 
jjjs(K,,-WK f )p(K f ,K d j z)p{z)dzdK l dK d 



(23) 



where 3 is the Dirac delta function. Clearly, this den- 
sity assigns zero probability to kinetic parameters that 
do not satisfy the Wegscheider conditions, since 

8{K d -Wfc / ) = 0, if K d ±Wic f . Note now that the 

log-rate constants Kd are of no immediate interest, since 
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their values can be determined as soon as the values of 
the log-rate constants Kyhave been estimated. As a con- 
sequence, we can treat Kd as "nuisance" parameters and 
integrate them out of the problem [13]. This integration, 
together with the updated prior density we presented 
above, leads to the following marginal posterior density 
of the log-rate constants Kf : 

PjKf I y) - js(K d -WK f ) P (K f ,K d | Y )dK i = P (K f ,WK f | Y ). (24) 

Clearly, the values of the marginal posterior p w (kj \ y) 
are proportional to the corresponding values of the origi- 
nal posterior density p(nj , K d \ y) over the thermodyna- 
mically feasible region of the parameter space, given by 

the hyperplane K d = Wk^ . In the following, we will base 

our Bayesian analysis approach on p w («y | y). 

Computing the posterior mode 

In a Bayesian setting, we use the location of the posterior 
density over the parameter space to provide an estimate of 
the unknown parameter values. Typically, two measures of 
location are employed, namely the mode and the mean of 
the posterior. The posterior mean minimizes the mean- 
square error between the estimated and true parameters, 
whereas, the posterior mode is more likely to produce 
dynamics that closely resemble the true dynamics (see 
Additional file 1 for why this is true). We note here that 
the main objective of parameter estimation in biochemical 
reaction systems is not necessarily to determine parameter 
values that are "close" to the true values (e.g., in the mean 
square sense) but to obtain appropriate values for the rate 
constants so that the resulting molecular concentration 
dynamics closely reproduce the dynamics observed in the 
true system [33]. As a consequence, we choose the poster- 
ior mode as our parameter estimator. 

The posterior log-density \np w (kj \ y) is usually not 
concave, especially when a limited amount of highly 
noisy data y is available. As a consequence, there is no 
optimization algorithm that can find the posterior mode 
in a finite number of steps. A method to address this 
problem would be to randomly sample the parameter 
space at a predefined (and usually large) number of 
points and use these points to initialize an optimization 
algorithm, such as the simultaneous perturbation sto- 
chastic approximation (SPSA) algorithm discussed in 
the Additional file 2. We can then calculate the para- 
meters and the associated values of \np w («y | y) 
obtained by each initialization after a set number of 
optimization steps, and declare the parameters asso- 
ciated with the highest log-posterior value as being the 
desired mode estimates. 

Unfortunately, SPSA (and as a matter of fact any other 
appropriate optimization algorithm) is computationally 



costly, especially in the case of large biochemical reac- 
tion systems. Therefore, using SPSA in the previous 
multi-seed strategy may result in a computationally pro- 
hibitive approach for finding the posterior mode. In 
order to reduce computations, we may choose only a 
small number of initial points that we believe are suffi- 
ciently proximal to the posterior mode. Two such points 
might be the prior and posterior means. As a matter of 
fact, as the data sample size tends to infinity, we expect 
that the posterior mean will coincide with the posterior 
mode, since, under suitable regularity conditions, the 
posterior density converges asymptotically to a Gaussian 
distribution [12,34]. This simple idea leads to the 
sequential maximization-expectation-maximization 
(MEM) algorithm we discuss in the Additional file 2. 
According to this algorithm, we perform a relatively 
small number of SPSA iterations, initialized by the prior 

mode, to obtain a posterior mode estimate ^™° de • We 
then use an MCMC algorithm, initialized by je™° de , to 

obtain an estimate of the posterior mean ^.™ ean . Subse- 
quently, we perform another set of SPSA iterations, 
initialized by ^™ ean , to obtain the posterior mode esti- 
mate K™2 d£ - We finally set ^ ode to be the log-rate 

constants that produce the maximum posterior value 
during all SPSA and MCMC iterations, and set the opti- 
mal estimate ^ of the log-rate constants k equal to 

~ mode ~ mode 
{Kf ,WKf }• 

Estimation accuracy 

One way to quantify the accuracy of the posterior mode 
estimate ^™ ode of a "free" log-rate constant Kyis to cal- 
culate and report the root mean square error (RMSE), 
given by 



r /2 

j( K f-K?° de ) 2 Pwi K f \ Y)d K f | 



(25) 



A small value of £rmse provides us with confidence 
that the estimated value of that constant is accurate. On 
the other hand, the estimate may be perceived as inac- 
curate if £rmse is exceedingly large. 

Another useful metric for evaluating estimation accu- 
racy is D := lndet[V] / (M + MJ , where det[V]is the 
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determinant of the posterior covariance matrix 



= E 



- mode ~ mode n 

[K f -Kf )[K f -Kf ) 



Note that D is 



the corresponding true integrated responses 
x^\t)dt,ne. Af,p& V) . Normalization is required 



the average of the log-eigenvalues of V and is related to 
the well-known D-optimal criterion used in experimen- 
tal design [27]. We can use D to quantify the overall 
accuracy of a model calibration result, with smaller 
values of D indicating better overall accuracy. 

Note that the RMSE's 6rmse can be computed from 
the diagonal elements of V . It turns out the we can 

approximate £rmse and D from an estimate yof the 
posterior covariance matrix V obtained during the sec- 
ond (MCMC) phase of the proposed MEM algorithm 
(see Additional file 2for details). 

When the true values K true of the log-rate constants 
are known (which is the case when we use simulated 
data to evaluate the performance of the proposed Baye- 
sian analysis approach, as we do in this paper), we can 
provide a more direct evaluation of estimation perfor- 
mance. As we have mentioned previously, calculating a 
measure of "closeness" (such as the square error) 
between the estimated and true parameter values may 
not be quite appropriate here. Since, in reality, our 
objective is to estimate the rate constant values so that 
the biochemical reaction system produces dynamics that 
closely match the true molecular dynamics, it may be 
more appropriate to use, as measures of estimation per- 
formance, the following median and maximum absolute 
error criteria: 



e MED-AE - m r e< i 
neAf.peV 



£ max ae - roax 

neN.peV 



JJi ( „ p) (t)-4 p >(t) 



dt 




(26) 



where {x [ n p \t),ts T) and (x^\t),t& T} are th e true 

and estimated dynamics of the « th molecular species 
under the p th perturbation, produced by the biochemical 
reaction system with log-rate constants K true and 

k = {jc/ 0de ,Wic/ 0de }' respectively. Clearly, e M ed-ae 

and Emax-ae provide measures of closeness between the 
estimated molecular responses 

{xn\t),t& T,n& H, pe V} and the true molecular 

responses {x|, p) (r),te T, n e Af,p e V} , normalized by 



i 



in order to make sure that no one species dominates 
the error values more than any other. Finally, note that 
half of the normalized absolute errors will be between 0 
and £med-ae> whereas, the remaining half will be 
between £ MED -ae and Cmax-ae- 

Results/Discussion 

To illustrate key aspects of the previous Bayesian analy- 
sis methodology, we now consider a numerical example 
based on a subset of a well-established model of the 
EGF/ERK signal transduction pathway proposed by 
Schoeberl et al. [35]. This model corresponds to an 
open biochemical reaction system, since it contains irre- 
versible reactions as well as reactions governed by 
Michaelis-Menten kinetics that involve molecular spe- 
cies not included in the model. We extract a closed sub- 
set of the Schoeberl model by choosing the largest 
connected section that contains only reversible reactions 
governed by mass action kinetics. The resulting bio- 
chemical reaction system is depicted in Figure 1 and is 
comprised of N = 13 molecular species that interact 
through M = 9 reversible reactions. Of course, we could 
attempt to generate a closed biochemical reaction sys- 
tem for the entire EGF/ERK signaling pathway, by 
including all relevant molecular species not considered 
by the Schoeberl model (e.g., ADP, ATP, intermediate 
forms in catalyzed reactions, etc.). However, since we 
are only interested in demonstrating the potential and 
key properties of our Bayesian analysis methodology, we 
found this to be unnecessary. We feel that the biochem- 
ical reaction system depicted in Figure 1 leads to a suffi- 
ciently rich numerical example that serves the main 
purpose of this section well. 

In specifying the model depicted in Figure 1, we must 
provide three sets of physically reasonable values: true 
rate constant values, initial concentrations, and experi- 
mentally feasible perturbations to the initial concentra- 
tions. Published values for the reaction rate constants 
associated with our example are given in Equation 
(S-3.1) of Additional file 3. However, these values do 
not correspond to a thermodynamically feasible bio- 
chemical reaction system, since they do not satisfy the 
Wegscheider conditions, given by (11). We should point 
out here that this is a common problem in systems biol- 
ogy. Reaction rate values are usually amalgamated from 
various independent sources in the literature, so it is 
highly unlikely that these values will correspond to a 
thermodynamically feasible biochemical reaction system. 
As a consequence, it is desirable to develop a method 
that uses published values for the reaction rate constants 
and calculates an appropriate set of thermodynamically 
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(EGF-EGFR*)2-GAP (x 7 ) 



Grb2 (x 2 ) 



(EGF-EGFR*)2-GAP-Grb2 (x s ) 



t,k, 




Ras-GTP* (x l2 ) 



Shc*-Grb2-Sos A 

(* 5 ) 



► (EGF-EGFR*)2-GAP-Grb2-Sos 

► Ras-GDP (x l0 ) 



(EGF-EGFR*)2-GAP-Grb2-Sos-Ras-GTP 
(EGF-EGFR*)2-GAP-Grb2-Sos-Ras-GDP Ob) 




Figure 1 A subset of the EGF/ERK signal transduction pathway model proposed in [35], The biochemical reaction system is comprised of 
N = 13 molecular species that interact through M = 9 reactions. Bayesian analysis is focused on estimating the values of the 18 rate constants 
associated with the reactions. 



feasible values that can be considered as the "true" para- 
meter values. In Additional file 3, we calculate "true" 
values for the log-rate constants by using a linear least 
squares approach to project the published values onto 
the thermodynamically feasible hyperplane. The result- 
ing "true" values are given in Equation (S-3.3) of Addi- 
tional file 3. 

Regarding the initial concentrations, we use the values 
specified in [35,36], with two minor modifications. First, 
molecular species with zero initial concentrations are 
modified to have a small number of molecules present. 
We do this to accommodate the fact that, in a real cel- 
lular system, these molecular species are constitutively 
expressed. The second modification comes from the fact 
that we are no longer modeling the entire EGF/ERK sig- 
naling cascade and, therefore, we must account for the 
upstream EGF stimulus. To take this into account, we 
increase the initial concentration of the most upstream 
molecular species in our model, namely (EGF-EGFR*) 
2-GAP. The initial concentrations used are given by 
Equation (S-3.4) in Additional file 3. 

To specify appropriate perturbations to the initial 
molecular concentrations, note that molecular com- 
plexes, such as dimers, trimers, etc., are far more diffi- 
cult to perturb than simple monomeric molecular 
species. For this reason, we focus our perturbation 
efforts on She", Grb2, and Sos. Since She* is 



commercially available in a purified and quantified form, 
we will assume that we can increase its initial concen- 
tration by a factor of 100 using molecular injection. We 
will also assume that we can perturb Grb2 and Sos by 
RNAi, resulting in a decrease in their initial concentra- 
tions by a factor of 100. Thus, we set n\ = 99ci, n 2 = 
-.99c 2 , and n 4 = -.99c 4 . 

To avoid specifying different hyperparameter values 
for the prior densities of the forward log-rate constants, 
we assume here that all densities share the same known 
values t, X], where k = -5.1010, r= 1.8990, and A= 
0.7409, whereas, we set CC= 3 and b = 1 for the hyper- 
parameters of the prior density of the variance cr 2 of the 
measurement errors. These choices correspond to the 
prior densities depicted in Figure S-1.2(a) and Figure 
S-1.3(a) in Additional file 1. We implement our Bayesian 
analysis approach using the MEM algorithm described 
in Additional file 2, with / = 5,000 SPSA iterations in 
each maximization step and a total of L = 50,000 
MCMC iterations in the expectation step. Finally, we 
observe the biochemical reaction system within a time 
period of 1 min. 

In Figure 2, we depict a typical result obtained by the 
proposed Bayesian analysis algorithm. In this figure, we 
compare the estimated log-rate values (blue) with the 
thermodynamically consistent true log-rate values (red) 
as well as the corresponding concentration dynamics of 
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Figure 2 True (red) vs. estimated (blue) log-rate values and selected molecular dynamics in the unperturbed biochemical reaction 
system depicted in Figure 1. The results are based on measuring the dynamics in the unperturbed and perturbed systems at Q = 6 
logarithmically-spaced time points (green circles). Perturbations are applied on the initial concentrations of She*, Grb2, and Sos, one at a time. 
The measurement errors are i.i.d. zero-mean Gaussian with standard deviation o~= 0.3. 
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Table 1 Estimated posterior RMSE values for the case of i.i.d. zero-mean Gaussian errors with standard deviation 
a = 0.3. Logarithmic sampling is used with Q = 6 

Kl K 3 K 5 K 7 Kg K nn K n3 K n5 K 17 

0.2414 0.1578 0.1838 0.2950 0.1426 0.1683 0.0968 0.4474 0.1484 

K2 K>4 K 6 Kg K]0 K>\7 Kl4 K 16 ^18 

0.2594 0.2095 0.1704 0.2124 0.2136 0.5093 0.0494 
The log-rate constants k 8 and k 14 are "dependent'Variables. Therefore, no RMSE values are reported for these variables. 



selected molecular species in the unperturbed biochem- 
ical reaction system. We have obtained these results by 
measuring the concentration dynamics in the unper- 
turbed and perturbed systems at Q = 6 logarithmically- 
spaced time points (green circles), with the measure- 
ments being corrupted by independent and identically 
distributed (i.i.d.) zero-mean Gaussian noise with stan- 
dard deviation <7= 0.3. Moreover, we summarize the 
estimated posterior RMSE values, given by (25), in 
Table 1. Finally, the calculated median and maximum 
absolute error values, given by (26), are 3.03 xlO' 2 and 
1.68 xlO 1 , respectively. 

The concentration dynamics produced by the estimated 
rate constant values match well the dynamics produced 
by the true values. As a matter of fact, the calculated 
median and maximum absolute error values imply that 
half of the relative integrated absolute error values 
between the estimated and true concentration dynamics 
(across all molecular species and all applied perturba- 
tions) are smaller than 3.03%, whereas, the remaining 
values are between 3.03% and 16.8%. On the other 
hand, the estimated posterior RMSE values summarized 
in Table 1 indicate a high probability that, given the 
concentration measurements, the log-rate values will lie 
within a relatively small region around the correspond- 
ing posterior mode values. 

We expect that, in general, by selecting appropriate 
perturbations and by increasing the number of concen- 
tration data collected during an experiment, we can 
improve estimation accuracy. However, how can one 
know if the right perturbations have been applied on 
the biochemical reaction system and if enough data has 
been collected in a practical situation? Inspection of 
RMSE values can provide an answer to these important 
questions. If the estimated RMSE values of the log-rate 
constants of many reactions are large, it may be worth 
collecting additional data by increasing P and Q. Addi- 
tional data can improve estimation accuracy by shrink- 
ing the RMSE values to a size that indicates an 
acceptable degree of uncertainty. However, if the bio- 
chemical reaction system is insensitive to a given kinetic 
parameter, then the RMSE associated with that reaction 
may remain large even as the quality of data improves. 
Therefore, additional data should only be collected 
when the RMSE values are large and sensitivity analysis 



indicates that the values of the rate constants associated 
with these RMSE values appreciably affect the system 
dynamics. 

The RMSE values do not provide a global measure of 
estimation accuracy, since some parameters may have 
small RMSE values and some may have large values. To 
address this problem, we may instead employ the 
D-optimal criterion as a measure of estimation accuracy. 
As a matter of fact, we can effectively use the D-optimal 
criterion as a guide for selecting appropriate perturba- 
tions and for determining the data sampling scheme we 
must use in order to increase estimation accuracy. In 
Table 2, for example, we summarize estimated values of 
D, for the case of uniform and logarithmic sampling, 
calculated for different values of Q. Clearly, the sampling 
scheme used may appreciably affect estimation perfor- 
mance. For each value of Q, uniform sampling results in 
higher values of D than logarithmic sampling. As a con- 
sequence, we must use logarithmic sampling over uni- 
form sampling, since the former may produce better 
estimation accuracy than the latter. This is expected, 
since uniform sampling may result in measuring steady- 
state concentrations much more often than (short-lived) 
transient concentrations. On the other hand, logarithmic 
sampling may be used to gather valuable information 
about the transient behavior of a biochemical reaction 
system while placing less emphasis on its steady-state 
dynamics (which only provide information about the 
equilibrium constants of the underlying reactions). The 
results depicted in Table 2 also suggest an appropriate 
value for Q. If our goal is to find the smallest value Q* 
of Q (an objective dictated by the high cost of experi- 
mentally measuring molecular concentrations) which 
results in a value of D that is no less than, say 5%, of 



Table 2 Estimated values of the D-optimal criterion for 
uniform and logarithmic sampling schemes 



Q 


uniform 


logarithmic 


% change 


2 


-1.7697 


-2.3500 




3 


-2.0030 


-3.4287 


45.90% 


4 


-2.3752 


-3.7432 


9.17% 


5 


-2.6115 


-4.1173 


9.99% 


6 


-2.3492 


-4.1039 


-0.33% 


The measurement errors are i.i.d. 
<y= 0.3. 


zero-mean Gaussian with standard deviation 
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Table 3 Estimated values of the D-optimal criterion for 


different replications and perturbations 




Perturbation 


0 


NO: 1 replication 


-3.0123 


NO: 2 replications 


-3.4950 


NO: 3 replications 


-3.7544 


YES: She* 


-3.1398 


YES: Grb2 


-3.0747 


YES: Sos 


-3.4531 


YES: She*, Grb2 


-3.9279 


YES: She*, Sos 


-3.7716 


YES: Grb2, Sos 


-3.6363 


YES: She*, Grb2, Sos 


-4.1039 



The measurement errors are i.i.d. zero-mean Gaussian with standard deviation 
<7= 0.3. 

Logarithmic sampling is used with Q = 6. 



the value obtained when Q = Q* - 1, then we must set 

Q* = 6. 

In Table 3, we summarize the estimated values of D 
obtained from seven different perturbation experiments 
(logarithmic sampling is used with Q = 6). Moreover, 
we report the D values obtained by repeating an experi- 
ment that does not use molecular perturbations. Experi- 
mental replication may be an effective approach to 
obtain additional data, especially when molecular pertur- 
bations are costly or difficult to apply. Our formulation 
allows us to consider this scenario by setting n p = 0, for 
every p e V- The data collected this way correspond to 
repeating the same experiment P + 1 times, where P is 
the number of elements in V- The maximum experi- 
mental replication considered in Table 3 uses P = 3, 
which corresponds to repeating the same experiment 
four times. This produces the same amount of data as 
the data obtained by perturbing the initial concentra- 
tions of She*, Grb2, and Sos, one at a time. The values 
depicted in Table 3 suggest that perturbing the initial 
concentrations of She*, Grb2, and Sos may be the right 
thing to do, since this produces the lowest value of D 
and, thus, it may result in better estimation performance 
as compared to perturbing the initial concentrations 
of one or two of these molecular species. In this case, 
however, it may also be acceptable to replicate an 
experiment that does not use molecular perturbations, 
since the minimum value of D is only 9.31% lower than 
the D value obtained by repeating the experiment four 
times. 

One of the underlying assumptions associated with the 
proposed Bayesian analysis algorithm is that the mea- 
surement errors are statistically independent, following a 
zero-mean Gaussian distribution with standard deviation 
a. To assess the adequacy of this assumption and 



Table 4 Median and maximum absolute error values 
under a variety of measurement error conditions 



mean = 0 


i.i.d. Gaussian 


i.i.d. Uniform 


correlated Gaussian 


o~= 0.1 


3.98 x10~ 3 


8.64 x10~ 3 


1.48 x10~ 2 




5.56 x10~ 2 


4.81 x10~ 2 


7.32 X10~ 2 


a= 0.2 


1.01 x10~ 2 


1.78 x10~ 2 


3.09 X10~ 2 




8.29 x10~ 2 


1.30 X10" 1 


1.89 X10" 1 


a= 0.3 


3.03 x10~ 2 


1.78 x10~ 2 


3.05 x10~ 2 




1.68 X10" 1 


1.30 X10" 1 


2.46 X10" 1 


a= 04 


2.19 x10~ 2 


2.56 x10~ 2 


1.04 X10" 1 




2.27 X10" 1 


1.41 X10" 1 


3.67 X10" 1 


a= 0.5 


2.67 x10~ 2 


3.86 x10~ 2 


6.43 x10~ 2 




2.48 X10" 1 


3.32 X10" 1 


3.10 x10 _1 



Logarithmic sampling is used with Q = 6. 



evaluate its implication on estimation performance, we 
depict in Table 4 calculated median and maximum 
absolute error values obtained when the measurement 

errors jj^in (5) are i.i.d. zero-mean Gaussian with 

standard deviation <7, i.i.d. zero-mean uniform within 

the interval [-V3cr, >/3ct] , with standard deviation a, 

and correlated zero-mean stationary Gaussian with auto- 

correlation EfoWfofo W(t 2 )] = a 2 exp{- | h - t 2 |} • 

We consider different values for the standard deviation, 
namely <7= 0.1, 0.2, 0.3, 0.4, 0.5, and measure the con- 
centration dynamics in the unperturbed and perturbed 
systems at Q = 6 logarithmically spaced time points. 
Table 4 shows clearly that violation of the i.i.d. Gaussian 
assumption may lead to reduction in estimation accu- 
racy, especially when the measurement errors are corre- 
lated, due to an increase in the maximum absolute error 
values. However, the calculated median absolute error 
values indicate that the proposed algorithm is relatively 
robust to the statistical behavior of the measurement 
errors, producing reasonable estimates for at least half 
of the concentration dynamics. In Figure 3, we depict 
results obtained by the proposed Bayesian analysis algo- 
rithm when measuring the concentration dynamics in 
the unperturbed and perturbed systems at Q = 6 loga- 
rithmically-spaced time points (green circles), with the 
measurements being corrupted by correlated zero-mean 
stationary Gaussian errors with standard deviation o= 
0.3. These results compare favorably to the ones 
depicted in Figure 2. In this case, the calculated median 
absolute error value is 1.48 xlO" 2 , which is 62.8% smal- 
ler that the value obtained when the errors are i.i.d. 
zero-mean Gaussian, whereas, the calculated maximum 
absolute error value is 7.32 xlO' 2 , which is 31.7% larger 
that the value obtained when the errors are i.i.d. zero- 
mean Gaussian. 
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Figure 3 True (red) vs. estimated (blue) log-rate values and selected molecular dynamics in the unperturbed biochemical reaction 
system depicted in Figure 1. The results are based on measuring the dynamics in the unperturbed and perturbed systems at 0 = 6 
logarithmically-spaced time points (green circles). Perturbations are applied on the initial concentrations of She*, Grb2, and Sos, one at a time. 
The measurement errors are correlated zero-mean Gaussian with standard deviation o"= 0.3. 
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Conclusions 

In this paper, we have introduced a novel Bayesian ana- 
lysis technique for estimating the kinetic parameters 
(rate constants) of a closed biochemical reaction system 
from time measurements of noisy concentration 
dynamics. The proposed procedure enjoys a clear advan- 
tage over other published estimation techniques: the 
estimated kinetic parameters satisfy the Wegscheider 
conditions imposed by the fundamental laws of thermo- 
dynamics. As a consequence, it always leads to physi- 
cally plausible biochemical reaction systems. 

From a statistical perspective, there are additional 
advantages for thermodynamically restricting the kinetic 
parameters of a biochemical reaction system to satisfy 
the Wegscheider conditions. This may be seen through 
the well-known bias-variance tradeoff in estimation [27]. 
The mean squared error of a given estimator can be 
decomposed into a bias term and a variance term. In 
general, imposing constraints on the estimator may 
increase its bias but decrease its variance (hence the tra- 
deoff). However, if the true parameter values satisfy the 
constraints, then the variance may decrease without 
increasing the bias term [27]. Since the true values of 
the kinetic parameters must lie on the thermodynami- 
cally feasible manifold in the parameter space, confining 
the Bayesian estimator to this manifold (which is of 
lower dimension than the parameter space itself) may 
lead to lower mean squared error due to a smaller var- 
iance. Since the thermodynamically feasible manifold is 
of lower dimension than the parameter space, gains in 
variance (and hence improvements in the mean squared 
error) are expected to be large. This may be seen 
through the "curse of dimensionality," which refers to 
the exponential increase in the volume of the parameter 
space as its dimension grows, making estimation expo- 
nentially harder in higher dimensional spaces (in our 
example, the unconstrained parameter space has 12.5% 
more dimensions than the thermodynamically feasible 
subspace). The Wegscheider conditions reduce the 
dimensionality of the parameter space to a feasible 
region in which estimation may be easier. Thus, the pro- 
posed Bayesian analysis procedure improves on other 
estimation techniques by producing a statistically super- 
ior, physically meaningful and plausible estimate for the 
kinetic parameters of a closed biochemical reaction 
system. 

The Bayesian analysis methodology discussed in this 
paper has been formulated by assuming that all initial 
concentrations and perturbations are precisely known 
and that concentration measurements can be obtained 
by directly sampling all system dynamics. However, cur- 
rent experimental practices in quantitative systems biol- 
ogy restrict the amount and type of data that can be 



collected from living cells. As a consequence, further 
research is needed to develop approaches that can 
accommodate this important issue and make a Bayesian 
analysis approach to parameter estimation better applic- 
able to systems biology problems. 

If the initial concentrations and the perturbations 
applied on these concentrations are not known, then we 
may try to estimate them together with the unknown 
kinetic parameters. Although formulation of this problem 
is similar to the one considered in this paper, the addi- 
tional computational burden will be substantial. More- 
over, while quantitative biochemical techniques are 
improving, the vast majority of data available in problems 
of systems biology are obtained by measuring ratios of 
molecular concentrations (e.g., by using techniques such 
as SILAC [37]). Estimation of the rate constants of a bio- 
chemical reaction system from concentration measure- 
ments available as ratios relative to a reference system 
requires special consideration and extensive modification 
of the proposed Bayesian analysis procedure. Finally, it is 
very important to address the problem of missing obser- 
vations. This is a common problem in systems biology, 
since it is not possible to monitor and measure the con- 
centrations of all molecular species present in the system. 
Although appropriate modifications to the proposed 
algorithm can lead to a Bayesian analysis approach that 
can handle missing data, we think that development of a 
practically effective way to address this problem is chal- 
lenging. Our future plan is to expand and improve the 
Bayesian analysis procedure discussed in this paper in 
order to provide practical solutions to the previous 
problems. 

It is worth noting here that the estimation procedure 
suggested in this paper applies only to closed biochem- 
ical reaction systems (or to approximations of closed 
systems embedded in a larger open system). However, a 
cell is an open system, since it effectively interacts with 
its environment. If we include the cell's environment 
into our system and monitor the combined system until 
steady-state (i.e., until cell death), then we would have 
the necessary closed system. Unfortunately, this is 
clearly an unrealistic scenario. As a consequence, there 
is also a need to develop a theoretical and computa- 
tional approach for dealing with thermodynamically 
consistent parameter estimation in open biochemical 
reaction systems. 

To conclude, it has been argued in a recent paper [33] 
that most models of computational systems biology are 
"sloppy," in the sense that many parameters of such 
models do not appreciably alter system behavior. A key 
conclusion of this paper is that collective fitting proce- 
dures (such as the Bayesian analysis technique presented 
in the present paper) are far more desirable than 
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piecewise construction of a biochemical reaction system 
model from individual parameter estimates (which is 
how most models are constructed when investigators 
scour the literature for individual rate constant values). 
Moreover, it has been pointed out in [33] that using a 
method to obtain precise parameter values may be diffi- 
cult, even with an unlimited amount of data, since the 
behavior of a sloppy model is insensitive to the values of 
most parameters. As a consequence, the authors suggest 
that, instead of focusing on the quality of parameter 
estimation, it will be more wise to focus on the quality 
of prediction achieved by an estimated model (as we 
have also argued in this paper). 

To a certain extent, our Bayesian analysis approach 
addresses some of the issues raised in [33]. By imposing 
the Wegscheider conditions on the kinetic parameters 
of a biochemical reaction system, we can effectively con- 
strain these parameters to a thermodynamically feasible 
manifold in the parameter space, thus reducing sloppi- 
ness. Moreover, we can effectively use the RMSE values 
and the D-optimal criterion to determine an appropriate 
experimental design and distinguish those estimated 
values that can be trusted from those that cannot. For 
example, if the RMSE value associated with a kinetic 
parameter is small, then we may trust these values. On 
the other hand, a large RMSE value may indicate high 
uncertainty in the estimated parameter values, which 
may be untrustworthy. As we mentioned before, if a 
sensitivity analysis approach, such as the one proposed 
in [38], indicates that the kinetic parameters associated 
with large RMSE values are influential parameters, then 
we must reduce these RMSE values to an acceptable 
level of uncertainty by adopting a new and more effec- 
tive experimental design approach. On the other hand, 
if these parameters correspond to a non-influential reac- 
tion, then we can accept the estimated values with no 
further consideration, since high uncertainty in the 
exact values of these parameters will not affect the pre- 
dicted concentration dynamics. 

Additional material 



Additional file 1: In this document, we provide theoretical 
details necessary to understand the Bayesian analysis approach 
introduced in the Main text 

Additional file 2: This document contains a detailed description 
of the computational algorithms used for implementing various 
steps of the proposed Bayesian analysis approach 

Additional file 3: In this document, we list the biochemical 
reactions associated with our numerical example and provide 
thermodynamically consistent values for the rate constants as well 
as appropriate values for the initial molecular concentrations 
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